An Effective General Purpose Approach for Biomedical Document Classification
نویسنده
چکیده
Automated document classification can be a valuable tool for biomedical tasks that involve large amounts of text. However, in biomedicine, documents that have the desired properties are often rare, and special methods are usually required to address this issue. We propose and evaluate a method of classifying biomedical text documents, optimizing for utility when misclassification costs are highly asymmetric between the positive and negative classes. The method uses chi-square feature selection and several iterations of cost proportionate rejection sampling followed by application of a support vector machine (SVM), combining the resulting classifier results with voting. It is straightforward, fast, and achieves competitive performance on a set of standardized biomedical text classification evaluation tasks. The method is a good general purpose approach for classifying biomedical text. Introduction Text classification is the process of using automated techniques to assign text samples into one or more of a set of predefined classes. The text samples may be of any length including abstracts, titles, sentences, and full text documents. The techniques used to accomplish this are based on machine-learning algorithms, and typically require a set of training data having known classifications on which to fit a model, which is then used to classify previously unseen data. This simple approach has many potential uses in biomedicine including automated document triage for genomics database annotation, pre-filtering of search results for identification of high quality articles for evidence-based medicine, identification of biological entities in free text, and reduction in human labor needed to conduct systematic drug reviews. While biomedical document classification benefits from many of the existing developments in more general machine learning and text classification research, biomedical document classification does present its own unique challenges. Much of the computer science research focuses on evaluating effectiveness by measuring algorithm accuracy, the fraction of correct predictions. For biomedical text classification many problems have only two classes: the positive class of documents that has the desired characteristics, and the negative class that does not. In biomedical document classification the percentage of positive documents tends to be low. Clearly, measuring classification accuracy is not useful when 99% of documents are negative. Furthermore, biomedical tasks typically assign unequal costs to missing a positive document versus mistakenly assigning a negative document as a positive. In these scenarios, the positive documents tend to be rare, and mistakenly classifying any of them as negative is undesirable. Often, utility is the metric of performance, which is defined as: positives false u positives true u U nr r _ _ × + × = (1) where ur is the value of correctly predicting a positive document, and unr is the (usually negative) value of incorrectly predicting a negative document. Finally, there are many text classification algorithms and approaches, and it is not always feasible to study and compare a large selection of algorithms to determine the best one to use. Tuning an algorithm’s parameters for a given task may be costly in terms of time and training data. Clearly what is needed is a uniform, easily applied approach that takes into account the disparate costs of positive and negative classification mistakes and works well on a wide variety of tasks. Here we present our general approach to biomedical text classification. We show that this approach can be applied in a uniform manner, with minimal customization for the individual task. We then compare the performance of our method to previous methods used on the same test collection. Methods Our classification system uses word-based feature generation and chi-square feature selection, followed by a support vector machine (SVM) classifier wrapped with cost-sensitive resampling. We then apply and test our method on the four document triage tasks from the TREC 2005 Genomics track data. This collection consists of training and test sets of full text documents. Each document has been assigned positive or negative in four document triage tasks performed by the Mouse Genome Informatics group (MGI at http://www.informatics.jax.org/) when reviewing journal articles for information about mouse gene alleles, embryological expression, GO annotation, and tumor biology. Classification system: For each document in the training and test collections, the classification system generates a binary feature vector. For each document in the data set used here, the features we included were: 1) every word from the title and abstract with common English word stop list removal and Porter stemming, 2) all assigned MeSH terms, and 3) MGI mouse gene identifiers assigned to the documents by an automated process. We also removed any document from the process that did not include the MeSH term Mice. This is not an essential step in our general method, but is useful and easy to apply for organism specific classification tasks such as we are studying here. When classifying documents in the test collection, a document without the MeSH term Mice was predicted a negative. Each component in the binary feature vector is a one or a zero, signifying the presence or absence of a feature. This process may generate tens of thousands of features per data set, and can become unmanageable. We therefore reduce the feature set size by using only the features statistically significantly different between the positive and negative training documents, using a chi-square test with an alpha significance value of 0.05. We then train an SVM-based classifier, SVMLight on the classified training document vectors. SVM classifiers have become very popular because of their speed, accuracy, and ability to generalize well by maximizing the margin between positive and negative training data. However, the theory on which SVMs are based optimizes for accuracy, which can be undesirable for biomedical text classification. SVMLight includes a parameter to trade off the cost of misclassifying positive and negative documents. However, both ourselves and others have found this parameter to be ineffective for biomedical tasks. Instead, we account for asymmetric document misclassification costs using the document resampling method of Zadronzy, Langford, and Abe, which we summarize below. As far as we are aware, this method of accounting for asymmetric costs has not been applied to either general text or biomedical document classification. In simple terms their theory states that an optimal error rate classifier for a distribution D′ is an optimal cost minimizer for another distribution Dtest and the two distributions are simply related by:
منابع مشابه
Learning Document Image Features With SqueezeNet Convolutional Neural Network
The classification of various document images is considered an important step towards building a modern digital library or office automation system. Convolutional Neural Network (CNN) classifiers trained with backpropagation are considered to be the current state of the art model for this task. However, there are two major drawbacks for these classifiers: the huge computational power demand for...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملA New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملA Document Weighted Approach for Gender and Age Prediction Based on Term Weight Measure
Author profiling is a text classification technique, which is used to predict the profiles of unknown text by analyzing their writing styles. Author profiles are the characteristics of the authors like gender, age, nativity language, country and educational background. The existing approaches for Author Profiling suffered from problems like high dimensionality of features and fail to capture th...
متن کاملA New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier
With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...
متن کامل